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. Abstract 

We establish am excess risk bound of O for ERM with an //-smooth loss function 

and a hypothesis class with Rademacher complexity 1Z„, where L* is the best risk achievable by the 
hypothesis class. For typical hypothesis classes where TZ„ = \J 'R/n, this translates to a learning rate of 
O (RH/n) in the separable (L* = 0) case and O (RH/n + ^/L*RH/nj more generally. We also provide 
similar guarantees for online and stochastic convex optimization of a smooth non-negative objective. 

1 Introduction 
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Consider empirical risk minimization for a hypothesis class % ~ {h : X — > K} w.r.t. some non-negative loss 
\^ • function (j)(t,y). That is, we would like to learn a predictor h with small risk 
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OQ. Lh = E[<l>(h(X),Y)} 
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by minimizing the empirical risk 



of an i.i.d. sample (xi,yi), . . . , (x n ,y n ). 
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L(h) = -y2<f>(h(xi),yi) 
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Statistical guarantees on the excess risk are well understood for parametric (i.e. finite dimensional) hypothesis 
classes. More formally, these are hypothesis classes with finite VC-subgraph dimension [24{ (aka pseudo- 
dimension). For such classes learning guarantees can be obtained for any bounded loss function (i.e. s.t. \<j)\ < 
b < oo) and the relevant measure of complexity is the VC-subgraph dimension. 

Alternatively, even for some non-parametric hypothesis classes (i.e. those with infinite VC-subgraph dimen- 
sion), e.g. the class of low- norm linear predictors 

Hb ={h w :xh-> (w,x)|||w|| < B} , 

guarantees can be obtained in terms of scale- sensitive measures of complexity such as fat-shattering di- 
mensions [l|, covering numbers 24 1 or Rademacher complexity 0. The classical statistical learning theory 



approach for obtaining learning guarantees for such scale-sensitive classes is to rely on the Lipschitz constant 
D of (f>(t,y) w.r.t. t (i.e. bound on its derivative w.r.t. t). The excess risk can then be bounded as (in 
expectation over the sample): 

Lh < L* + 2DK n (H) =L*+2\ D 2 - (1) 

V n 
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where h = argmin L{h) is the empirical risk minimizer (ERM), L* = inf^L/i is the approximation error, 
and Tln(H) is the Rademacher complexity of the class, which typically scales as 1Z n {l-L) — \J R/n. E.g. for 
^-bounded linear predictors, R — B 2 sup || -X" | j ^ - The Rademacher complexity can be bounded by other 
scale-sensitive complexity measures, such as the fat-shattering dimensions and covering numbers, yielding 
similar guarantees in terms of these measures. 

In this paper we address two deficiencies of the guarantee (p}. First, the bound applies only to loss functions 
with bounded derivative, like the hinge loss and logistic loss popular for classification, or the absolute- value 
(^i) loss for regression. It is not directly applicable to the squared loss <j>(t,y) — \{t — y) 2 , for which the 
second derivative is bounded, but not the first. We could try to simply bound the derivative of the squared 
loss in terms of a bound on the magnitude of h(x), but e.g. for norm-bounded linear predictors T-Lb this 
results in a very disappointing excess risk bound of the form 0(^/£? 4 (max || A||) 4 /n). One aim of this paper is 
to provide clean bounds on the excess risk for smooth loss functions such as the squared loss with a bounded 
second, rather then first, derivative. 

The second deficiency of ([T]) is the dependence on 1/y/n. The dependence on 1/y/n might be unavoidable in 
general. But at least for finite dimensional (parametric) classes, we know it can be improved to a 1/n rate 
when the distribution is separable, i.e. when there exists h g H with Lh = and so L* = 0. In particular, 
if T-L is a class of bounded functions with VC-subgraph-dimension d (e.g. d-dimensional linear predictors), 
then (in expectation over the sample) [23j | : 



Lh<L* + 0[* m ° gn + f DL ^ n ^ (2) 

The 1/y/n term disappears in the separable case, and we get a graceful degredation between the \j\fn 
non-separable rate and the 1/n separable rate. Could we get a 1/n separable rate, and such a graceful 
degradation, also in the non-parametric case? 

As we will show, the two deficiencies are actually related. For non-parametric classes, and non-smooth 
Lipschitz loss, such as the hinge-loss, the excess risk might scale as 1/ \fn and not 1/n, even in the separable 
case. However, for _ff-smooth non-negative loss functions, where the second derivative of <fi(t, y) w.r.t. t is 
bounded by H, a 1/n separable rate is possible. In Section [2] we obtain the following bound on the excess 
risk (up to logarithmic factors): 



Lh < L* + OlHTZKn) + VHL*K n (H) 



r* ^ HR HRL* \ - ( HR\ 

y~n~ + V n J * 2L + °{— J' (3) 

In particular, for ^-norm-bounded linear predictors Hb with sup \\XW2 < 1, the excess risk is bounded by 
0(HB 2 /n + y/ HB 2 L* /n). Another interesting distinction between parametric and non-parametric classes, 
is that even for the squared-loss, the bound ((3]) is tight and the non-separable rate of 1/y/n is unavoidable. 
This is in contrast to the parametric (fine dimensional) case, where a rate of 1/n is always possible for 
the squared loss, regardless of the approximation error L* [16J. The differences between parametric and 
scale-sensitive classes, and between non-smooth, smooth and strongly convex (e.g. squared) loss functions 
are discussed in Section |4] and summarized in Table [TJ 

The guarantees discussed thus far are general learning guarantees for the stochastic setting that rely only on 
the Rademacher complexity of the hypothesis class, and are phrased in terms of minimizing some scalar loss 
function. In Section [3] we consider also the online setting, in addition to the stochastic setting, and present 
similar guarantees for online and stochastic convex optimization 34, 25|. The guarantees of Section |3] match 



equation Q for the special case of a convex loss function and norm-bounded linear predictors, but Section |3] 
capture a more general setting of optimizing an arbitrary non-negative convex objective, which we require to 
be smooth (there is no separate discussion of a "predictor" and a scalar loss function in Section [3]) . Results 
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in Section [3] are expressed in terms of properties of the norm, rather then a measure of concentration like 
the Radamacher complexity as in ([3]) and Section[2] However, the online and stochastic convex optimization 
setting of Section [3] is also more restrictive, as we require the objective be convex (in Section [5] and for the 
bound ([3]) we make no assumption about the convexity of the hypothesis class % nor the loss function (f>). 

Specifically, for a non- negative iJ-smooth convex objective (see exact definition in Section [3j, over a domain 
bounded by B, we prove that the average online regret (and so also the excess risk of stochastic optimization) 
is bounded by 0(HB 2 /n + ^JHB 2 L* /n). Comparing with the bound of 0{\J D 2 B 2 /n) when the loss is D- 
Lipschitz rather then H-smooth [3J, |22j, we see the same relationship discussed above for ERM. Unlike 
the bound §3§ for the ERM, the convex optimization bound avoids polylogarithmic factors. The results in 
Section [3j also generalize to smoothness and boundedness with respect to non-Euclidean norms. 

Studying the online and stochastic convex optimization setting (Section , in addition to ERM (Section [5]), 
has several advantages. First, it allows us to obtain a learning guarantee for an efficient single-pass learning 
methods, namely stochastic gradient descent (or mirror descent), as well as for the non-stochastic regret. 
Second, the bound we obtain in the convex optimization setting (Section |3j is actually better then the bound 
for the ERM (Section [5J as it avoids all polylogarithmic and large constant factors. Third, the bound is 
applicable to other non-negative online or stochastic optimization problems beyond classification, including 
problems for which ERM is not applicable (see, e.g., |25j). 



2 Empirical Risk Minimization with Smooth Loss 

Recall that the Rademacher complexity of % for any n £ N given by : 

K n {U) = sup E^ Uni f( {± i}«) 
Throughout we shall consider the "worst case" Rademacher complexity. 

Our starting point is the learning bound (TTJ) that applies to £>-Lipschitz loss functions, i.e. such that 
|<^'(i, 2/)| < D (we always take derivatives w.r.t. the first argument). What type of bound can we obtain if 
we instead bound the second derivative <fi"(t, y)l We will actually avoid talking about the second derivative 
explicitly, and instead say that a function is _ff-smooth iff its derivative is iJ-Lipschitz. For twice differentiable 
</>, this just means that \<j)"\ < H . The central observation, which allows us to obtain guarantees for smooth 
loss functions, is that for a smooth loss, the derivative can be bounded in terms of the function value: 

Lemma 2.1. For an H -smooth non-negative function / : M i — >■ R 7 we have: \f'(t)\ < \/4H f(t) 



1 

sup — 
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(4) 



Proof. For any t, r, we have t < s < r for which fir) = f(t) + f'(s)(r — t). Now: 

< f(r) = f{t) + f(t)(r -t) + (f(s) - f\t)){r t) 
< f(t) + f(t)(r - t ) + H\a-t\\r-t\< f(t) + f'(t)(r - t) + H(r - t) 
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Setting r — t — 4^ yields the desired bounds. □ 



This Lemma allows us to argue that close to the optimum value, where the value of the loss is small, 
then so is its derivative. Looking at the dependence of ([1} on the derivative bound D, we are guided by 
the following heuristic intuition: Since we should be concerned only with the behavior around the ERM, 
perhaps it is enough to bound <j)'(\v,x) at the ERM w. Applying Lemma 12.11 to L(h), we can bound 

|E [</>'(w, X)]\ < \J 4HL(h). What we would actually want is to bound each |^'(w,x)| separately, or at 
least have the absolute value inside the expectation — this is where the non-negativity of the loss plays an 
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important role. Ignoring this important issue for the moment and plugging this instead of D into (TTJ yields 



L(h) <L*+ 4\jHL(h)K n (H). Solving for L(h) yields the desired bound ©. 
This rough intuition is captured by the following Theorem: 

Theorem 1. For an H -smooth non-negative loss (f> s.i.V^.y.h \cf)(h(x), y)\ < b, for any S > we have that 
with probability at least 1 — S over a random sample of size n, for any h € H, 




Hlog-n *»(«) + d b -^m\ + Hlo^n nl(H) + 



H log-n 7^) + V^V^J + ^ l0 S 3 - + 
where K < 10 5 is a numeric constant derived from JFh] and @/. 

Note that only the "confidence" terms depended on 6 = sup|0|, and this is typically not the dominant 
term — we believe it is possible to also obtain a bound that holds in expectation over the sample (rather than 
with high probability) and that avoids a direct dependence on sup \ <p\. 

To prove Theorem Q] we use the notion of Local Rademacher Complexity which allows us to focus on the 
behavior close to the ERM. To this end, consider the following empirically restricted loss class 



3 (r) := {(x,y) h-> cj>{h{x),y) : h G H,L{h) < r} 



Lemma [2~2l presented below, solidifies the heuristic intuition discussed above, by showing that the Rademacher 
complexity of C^ir) scales with yj Hr. The Lemma can be seen as a higher-order version of the Lipschitz Com- 
position Lemma [2j , which states that the Rademacher complexity of the unrestricted loss class is bounded 
by D1Z n (H.). Here, we use the second, rather then first, derivative, and obtain a bound that depends on the 
empirical restriction: 

Lemma 2.2. For a non-negative H-smooth loss </> bounded by b and any function class % bounded by B: 



Kn(C*(r)) < VuITr n n {%) ( 161og 3 / 2 ( - 141og 3 / 2 ' ""' V2IID ] 



n n (H)J b \ Vb J J 

Proof. In order to prove Lemma 12 .21 we actually move from Rademacher complexity to covering numbers, 
use smoothness and Lemma |2. II to obtain an r-dependent cover of the empirically restricted class, and then 
return to the Rademacher complexity. More specifically: 

• We use a modified version of Dudley's integral to bound the Rademacher complexity of the empirically 
restricted class in terms of its Z/2-covering numbers. 

• We use smoothness to get an r-dependent bound on the ^-covering numbers of the empirically re- 
stricted loss class in terms of Loo-covering numbers of the unrestricted hypothesis class. 

• We bound the Loo-covering numbers of the unrestricted class in terms of its fat-shattering dimension, 
which in turn can be bounded in terms of its Rademacher complexity. 

Before we proceed, recall the following definitions of covering numbers and fat shattering dimension. For 
any e > and function class T C K z : 

The L,2 covering number A/2 (J-, e, n) is the supremum over samples z\, . . . , z n of the size of a minimal cover 



C e such that V/ e J", 3f t e C t s.t. J I £? =1 (/(^) - U Zl )Y < e 
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The .Loo covering number Afca (J 7 , e, n) is the supremum over samples z\, . . . , z n of the size of a minimal 
cover C e such that V/ £ T, 3f e £ C e s.t. max i6 [„] \f{zi) — f e {zi)\ < e. 

The fat-shattering dimension fat e (J 7 ) at scale e is the maximum number of points e-shattered by T (see 
e.g. [HI). 

Bounding TZ n {C^{r)) in terms of N2 (£</,(?")) Dudley's integral bound lets us bound the Rademacher 
complexity of a class in terms of its empirical L2 covering number. Here we use a more refined version of 
Dudley's integral bound due to Mendelson [2l[ and more explicitly stated in [HJ and included for complete- 
ness as Lemma I A. 3 1 in the Appendix: 

n n {C,{r)) < inf | 4 a + 10 ] ^MM de ^ (5) 



Bounding A/" 2 (£0(r)) in terms of NooiH) In the Appendix we show that a corollary of Lemma [2.11 is 
that for a non-negative ii-smooth /(•) we have (f(t) — f(r)) 2 < 6H(f(t) + f(r))(t — r) 2 (Lemma lA.lj) . Using 
this inequality, for any sample (xi,yi), . . . , (x n ,y n ): 



1 " 
\ «=1 



2 < 



\ 



6H 



(<P(H z i), z i) + 4>(he(zi), zi)) (h(zi) - h e {zi)f 



< 



\ 



6H 

n 



(4>(h(zi),Zi) + <f)(h e (zi), z^) max(h(zi) - h e (zi)) 2 



< vY2Hr max\h(zi) — h € {zi)\ 



ie [n] 



That is, an empirical L x cover of |/i 6 T~L : L{h) < r\ at radius e/ \jYlHr is also an empirical L2 cover of 
C${t) at radius e, and we can conclude that: 



Af 2 (£*(r),e,n) K^UhsHx L(h) < r] , 



<Moo H 



(6) 



Bounding Nooffi) m terms of TZ n {Ji) The covering number at scale e/V12Hr can be bounded in 
terms of the fat shattering dimension at that scale as [2lj |: 



fat e (H) 



, . e \ I n \jYlHrB \ 



(7) 



Hence by Equation ([5]) we have: 



n n {C,p{r)) < inf 4a + 10 J \ 



fat . (J4) W f 



n Vl2HrB 



de 



(8) 



choosing a = V '12 HrlZ n (H): 



< Wl2Hr K n (H) + 10 



y/br 



y/12HrK n {U) 



\ 



fat. 



7Tfe ,(H)log(^«B) 



de 



(9) 
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after a change of integration variable: 



< 4Vl2.Hr K n (H) + lOVUHr 



fe fat e CH)log(2f 



-de 



n n (H) 



bounding the fat-shattering dimension in terms of the Rademacher complexity f Lemma IA.2P 



< AVUHr K n (H) + 20Vl2Hr Tl n {U) 

2 



Tfe Jlog(^) 



de 



n n (H) 



< VUHr K„(H) 4 + 20 



|log 3/2 (nS/e) 



Vb/12H 

1Z n (H) 



< VUHr K n {H) |^4 + 14 ^log 3/2 

nB 



nB 



_ bg3/2 ^BVUH^j 



< VUHr n n {%) 181og : 



3/2 



n n (u) 



... m (nVV2HB\ 
Ml0g H/H 



(10) 



(11) 



□ 



Proof of Theorem^ Equipped with Lemma 12.21 the proof follows standard Local Rademacher arguments. 
Applying Theorem 6.1 of @ to VnO) = h&VH^log 1 ^ nR. n (U) we can show that: 



hh < L(h) + 106< + — (log i + log log n) 
n 



! L(h) ( 8r* + ^ (log | + log log n) 



where r* = 56 2 H log 3 nTZ n ('H) is the solution to ip n { r ) = r - Further details can be found in the Appendix. □ 



2.1 Related Results 



Rates faster than 1/v^ have been previously explored under various conditions, including when L* is small. 



The Finite Dimensional Case Lee et al [l6( showed faster rates for squared loss, exploiting the strong 
convexity of this loss function, even when L* > 0, but only with finite VC-subgraph-dimension. Panchenko 
p3| provides fast rate results for general Lipschitz bounded loss functions, still in the finite VC-subgraph- 
dimension case. Bousquet Q provided similar guarantees for linear predictors in Hilbert spaces when the 
spectrum of the kernel matrix (covariance of X) is exponentially decaying, making the situation almost 
finite dimensional. All these methods rely on finiteness of effective dimension to provide fast rates. In this 
case, smoothness is not necessary. Our method, on the other hand, establishes fast rates, when L* = 0, for 
function classes that do not have finite VC-subgraph-dimension. We show how in this non-parametric case, 
smoothness is necessary and plays an important role (see also Table [lj . 



Aggregation Tsybakov [31| studied learning rates for aggregation, where a predictor is chosen from the 
convex hull of a finite set of base predictors. This is equivalent to an t\ constraint where each base predictor 
is viewed as a "feature" . As with £i-based analysis, since the bounds depend only logarithmically on the 
number of base predictors (i.e. dimensionality), and rely on the scale of change of the loss function, they are 
of "scale sensitive" nature. For such an aggregate classifier, Tsybakov obtained a rate of 1/n when zero (or 
small) risk is achieve by one of the base classifiers. Using Tsybakov's result, it is not enough for zero risk to 
be achieved by an aggregate (i.e. bounded elli) classifier in order to obtain the faster rate. Tsybakov's core 



G 



result is thus in a sense more similar to the finite dimensional results, since it allows for a rate of 1/n when 
zero error is achieved by a finite cardinality (and hence finite dimension) class. 

Tsybakov then used the approximation error of a small class of base predictors w.r.t. a large hypothesis class 
(i.e. a covering) to obtain learning rates for the large hypothesis class by considering aggregation within 
the small class. However these results only imply fast learning rates for hypothesis classes with very low 
complexity. Specifically to get learning rates better than 1/s/n using these results, the covering number 
of the hypothesis class at scale e needs to behave as l/e p for some p < 2. But typical classes, including 
the class of linear predictors with bounded norm, have covering numbers that scale as 1/e 2 and so these 
methods do not imply fast rates for such function classes. In fact, to get rates of 1/n with these techniques, 
even when L* = 0, requires covering numbers that do not increase with e at all, and so actually finite 
VC-subgraph-dimension. 

Chesneau et al extend Tsybakov's work also to general losses, deriving similar results for Lipschitz loss 
function. The same caveats hold: even when L* = 0, rates faster when 1/y/n require covering numbers that 
grow slower than 1/e 2 , and rates of 1/n essentially require finite VC-subgraph-dimension. Our work, on 
the other hand, is applicable whenever the Rademacher complexity (equivalently covering numbers) can be 
controlled. Although it uses some similar techniques, it is also rather different from the work of Tsybakov 
and Chesneau et al, in that it points out the importance of smoothness for obtaining fast rates in the non- 
parametric case: Chesneau et al relied only on the Lipschitz constant, which we show, in Section 21 is not 
enough for obtaining fast rates in the non-parametric case, even when L* = 0. 

Local Rademacher Complexities Bartlett et al Q developed a general machinery for proving possible 
fast rates based on local Rademacher complexities. However, it is important to note that the localized 
complexity term typically dominates the rate and still needs to be controlled. For example, Steinwart (29j 
used Local Rademacher Complexity to provide fast rate on the 0/1 loss of Support Vector Machines (SVMs) 
(^-regularized hinge-loss minimization) based on the so called "geometric margin condition" and Tsybakov's 
margin condition. Steinwart' s analysis is specific to SVMs. We also use Local Rademacher Complexities in 
order to obtain fast rates, but do so for general hypothesis classes, based only on the standard Rademacher 
complexity H n (H) of the hypothesis classes, as well as the smoothness of the loss function and the magnitude 
of L* , but without any further assumptions on the hypothesis classes itself. 

Non-Lipschitz Loss Beyond the strong connections between smoothness and fast rates which we highlight, 
we are also not aware of prior work providing an explicit and easy-to-use result for controlling a generic non- 
Lipschitz loss (such as the squared loss) solely in terms of the Rademacher complexity. 

3 Online and Stochastic Optimization of Smooth Convex Objec- 
tives 

We now turn to online and stochastic convex optimization. In these settings a learner chooses w G W, where 
W is a closed convex set in a normed vector space, attempting to minimize an objective (loss) ^(w, z) on 
instances z G Z, where £ : W x Z — > R is an objective function which is convex in w. This captures learning 
linear predictors w.r.t. a convex loss function (f>(t,z), where Z = X x y and ^(w, (x,y)) = 4>({w, x) , y), and 
extends well beyond supervised learning. 

We consider the case where the objective £(w, z) is iJ-smooth w.r.t. some norm ||w|| (the reader may choose 
to think of W as a subset of a Euclidean or Hilbert space, and |jw|| as the ^-norm): By this we mean that 
for any z £ Z, and all w, w' 6 W 

||V£(w, z) - W(w', z)||„ < H ||w - w'|| 
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where || • ||* is the dual norm. The key here is to generalize Lemma [2.11 to smoothness w.r.t. a vector w, 
rather then scalar smoothness: 

Lemma 3.1. For an H -smooth non- negative f : W -> R, for all w e W; ||V/(w)||* < y/4Hf(w) 

Proof. For any Wo such that ||w — Wo|| < 1, let g(t) = g(wo + t(w — Wq)). For any t,s G R, 

Wit) - g'(s)\ = |(V/(w + t(w - w )) - V/(w + s(w - w )), w - w >| 

< ||V/(w + t(w - w )) - V/(w + s(w - w ))||* ||w - w || 

< H\t - s|||w - w || 2 
<ifli-sl 



Hence <? is ii-smooth and so by Lemma [27X1 < y/4Hg(i). Setting t = 1 we have, (V/(w), w — wo) < 

y/4Hf(w). Taking supremum over Wq such that ||wq — w|| < 1 we conclude that 



||V/(w)||*= sup (V/(w),w-w )< V4ff/(w) 

Wo: || w— wo || < 1 

In order to consider general norms, we will also need to rely on a non-negative regularizer F : W H > R that 
is a 1-strongly convex (see Definition in e.g. ,33!]) w.r.t. to the norm ||w|| for all w G W. For the Euclidean 
norm we can use the squared Euclidean norm regularizer: F(w) = § ||w|| 2 . 

3.1 Online Optimization Setting 

In the online convex optimization setting we consider an n round game played between a learner and an 
adversary (Nature) where at each round i, the player chooses a Wj G W and then the adversary picks a 
Zi G Z. The player's choice Wj may only depend on the adversary's choices in previous rounds. The goal of 
the player is to have low average objective value ^ Yl7=i ^( w »> z i) compared to the best single choice in hind 
sight 0. 

A classic algorithm for this setting is Mirror Descent 0] , which starts at some arbitrary wi G W and updates 
Wj+i according to Zi and a stepsize r\ (to be discussed later) as follows: 

w 4+ i <- arg min (ryV^(w 4 , Zi ) - VF(wj), w) + F(w) (12) 

w VV 

For the Euclidean norm with F(w) — i 1 1 w 1 1 2 , the update (fT2")) becomes projected online gradient descent 

w i+ i <- n w (wi - r]V£(wi, z^) (13) 
where IIw(w) = argmin W ' e w ||w — w'|| is the projection onto W. 

Theorem 2. For any B G R and L* if we use stepsize ri = . 1 for the Mirror Descent 

HB 2 + y H 2 B 4 +HB 2 nL" 

algorithm then for any instance seauence 01, ... , z n G -3, ^/z,e average regret w.r.t. any w* G W s.i. F(w*) 5^ 
_B 2 and i 2?=i ^( w *i z «) — L* is bounded by: 



lA J( , , . AHB 2 / HB 2 L* 
-V^(w ij «i)-->J^(w*,«i)< +2W 



n z — ' n 

i=l i=l 



Note that the stepsize depends on the bound L* on the loss in hindsight. 

Proof. The proof follows from Lemma [3.11 and Theorem 1 of |28j, using U\ = B 2 and U2 = nL* in the 
Theorem. □ 
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3.2 Stochastic Optimization I: Stochastic Mirror Descent 



An online algorithm can also serve as an efficient one-pass learning algorithm in the stochastic setting. Here, 
we again consider an i.i.d. sample zi, . . . , z n from some unknown distribution (as in Section^, and we would 
like to find w with low risk L(w) = E [£(w, Z)]. When z = (x, y) and ^(w, z) = </>((w, x), y) this agrees with 
the supervised learning risk discussed in the Introduction and analyzed in Section [5] But instead of focusing 
on the ERM, we run Mirror Descent (or Projected Online Gradient Descent in case of a Euclidean norm) 
on the sample, and then take w = ^J^iLi w i- Standard arguments Q allow us to convert the online regret 
bound of Theorem [2] to a bound on the excess risk: 

Corollary 3. For any B £ R and L* if we run Mirror Descent on a random sample with stepsize r\ — 
1 then for any w* G W with F(w*) < B 2 and L(w*) < L* , with expectation over the 

HB2 + ^/H 2 B4+HB 2 nI7 J V \ / - \ ) - 

sample: 



Lw n - Lw* < h 2 

n 

Again, one must know a bound L* on the risk in order to choose the stepsize. 

It is instructive to contrast this guarantee with similar looking guarantees derived recently in the stochastic 
convex optimization literature [14j . There, the model is stochastic first-order optimization, i.e. the learner 
gets to see an unbiased estimate VZ(w, Zi) of the gradient of L(w). The variance of the estimate is assumed to 
be bounded by a 2 . The expected accuracy after n gradient evaluations then has two terms: a "accelerated" 
term that is 0(H/n 2 ) and a slow 0{a/y/n) term. While this result is applicable more generally (since it 
doesn't require non-negativity of £), it is not immediately clear if our guarantees can be derived using it. 
The main difficulty is that a depends on the norm of the gradient estimates. Thus, it cannot be bounded 
in advance even if we know that L(w*) is small. That said, it is intuitively clear that towards the end of 
the optimization process, the gradient norms will typically be small if L(w*) is small because of the self 
bounding property f Lemma 13. ip . Exploring this connection can be fruitful direction for further research. 



Ihb 2 l* 



3.3 Stochastic Optimization II: Regularized Batch Optimization 

It is interesting to note that using stability arguments, a guarantee very similar to Corollary [31 avoiding the 
polylogarithmic factors of Theorem[T]as well as the dependence on the bound on the loss (b in Theorem[lJ, can 
be obtained also for a "batch" learning rule similar to ERM, but incorporating penalty-type regularization. 
For a given regularization parameter A > define the regularized empirical loss as 

£ A (w) := L(w) + XF(w) 

and consider the Regularized Empirical Risk Minimizer 

wx = arg min L\(w) (14) 
wew 

The following theorem provides a bound on excess risk similar to Corollary [3] 

Theorem 4. For any Bel and T* if we set A = + J 128 ^ H2 + ^ffP then for all w* e W with 
F (w*) < B 2 and L(w*) < L* , we have that in expectation over sample of size n: 

T . 256HB 2 /2048 J ff J B 2 i7 

Lw\ — Lw < h "\ / . 

n y n 



9 



To prove Theorem 5] we use stability arguments similar to the ones used by Shalev-Shwartz et al [25|], which 
are in turn based on Bousquet and Elisseeff Q- However, while Shalev-Shwartz et al 25 1 use the notion 
of uniform stability, here it is necessary to look at stability in expectation to get the faster rates (uniform 
stability does not hold with the desired rate). 

To use stability based arguments, for each i £ [n] we consider a perturbed sample where instance Zi is replaced 
by instance z[ drawn independently from same distribution as z%. Let L^(w) = ^(Y^jjti ^( w : z j) + ^( w j Z D) 
be the empirical risk over the perturbed sample, and consider the corresponding regularized empirical risk 
minimizer = arg min w L^(w), where L^(w) = L«(w) + AF(w). We first prove the following Lemma 
on the expected stability of the regularized minimizer: 

Lemma 3.2. For any i £E [n] we have that 



E 



< 



A? 



[£(wa)] 



Proof. 



- L A (w A ) = 



< 



n n 



4'W)-4' j (wa) 



<i||w«-w A || (\\m(w<i\z i )\u + 1| V*(w A ,*<)|| 

1 -w A || (^(w^.zO + V^,^)) 



where the last inequality follows from Lemma 13. II By A-strong convexity of L\ we have that 

L A (W«)-iA(W A )>^||w«-w A || 2 . 

We can conclude that 



This gives us: 



*(w«, *,) - *(w A , zO < || V<(wf, «*)|U||w« - w A || 

4x/T V^(w a 1 U) + vA(-a^) 



< 



16i? 
An 



(£(W A l) ,z l )+£(w A ,zO) 



Taking expectation: 



E, 



£(wf ,2i)-|(wx,Zi) 



16# 
< - — E 



An 

16# 



£(w A i) ,z i )+£(w A ,4) 



An 



E 



z\,...,z n ,z i 



Lw 



(0 



Lw A 



32iT, 
An 



-E 2 



*„ [Lw A ] □ 



Proof of Theorem [^} By Lemma 13.21 we have : 

E Zl ,..., Zn [i A (w A ) - £ A (w^)] < E ai ,... )2n [jDa(wa) - ^a(wa) 



1 



L(w A ) - L(w A ) 
32if 



< 



An 



[£(wa)] 
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Noting the definition of L\(w) and rearranging we get 



E Z1 , ...,*„ [i(w A ) - L(w*)] < ^E Z1) ..., Z „ 

An 

Rearranging further we get 

E 

plugging in the value of A gives the result 



[L(w A )] + AF(w*) - AF(w A ) < ^E Z1 [L(w A )] + Ai^(w*) 

An 



,z n [Lw A ] - Lw* < I - 32H 



1 Lw* 



32H 
An 



F(w* 



□ 



4 Tightness 



In this Section we return to the learning rates for the ERM for parametric and for scale-sensitive hypothesis 
classes (i.e. in terms of the dimensionality and in terms of scale sensitive complexity measures), discussed in 
the Introduction and analyzed in Section [5] We compare the guarantees on the learning rates in different 
situations, identify differences between the parametric and scale-sensitive cases and between the smooth and 
non-smooth cases, and argue that these differences are real by showing that the corresponding guarantees 
are tight. Although we discuss the tightness of the learning guarantees for ERM in the stochastic setting, 
similar arguments can also be made for online learning. 

Table [1] summarizes the bounds on the excess risk of the ERM implied by Theorem [1] as well previous bounds 
for Lipschitz loss on finite-dimensional [23[ and scale-sensitive Q classes, and a bound for squared-loss on 
finite-dimensional classes 0, Theorem 11.7] that can be generalized to any smooth strongly convex loss. 



Loss function is: 


Parametric 
dim(-H) < d , \h\<l 


Scale-Sensitive 


-D-Lipschitz 


dD i 1 dDL* 
n + V n 


1 D 2 R 

y — 


iJ-smooth 


dH i / dHL* 
n + V n 


HR i / HRL* 
n + V n 


i7-smooth and A-strongly Convex 


H dH 
A n 


HR i / HRL* 
n + V n 



Table 1: Bounds on the excess risk, up to polylogarithmic factors. 



We shall now show that the 1/y/n dependencies in Table Q] are unavoidable. To do so, we will consider the 
class T~L = {x h- > (w,x) : ||w|| < 1} of ^-bounded linear predictors (all norms in this Section are Euclidean), 
with different loss functions, and various specific distributions over X X y, where X = {x e M. d : ||x|j < l} 
and Y — [0, 1]. For the non-parametric lower-bounds, we will allow the dimensionality d to grow with the 
sample size n. 

Infinite dimensional, Lipschitz (non-smooth), separable 

Consider the absolute difference loss ^(/i(x),y) = \h(x) — y\, take d = 2n and consider the following distri- 
bution: X is uniformly distributed over the d standard basis vectors e^ and if X = e;, then Y = ^j^i, where 
ri, . . . , Td € {±1} is an arbitrary sequence of signs unknown to the learner (say drawn randomly beforehand) . 
Taking w* = A- J27=i r ' e i' ll w *ll = 1 an< ^ L* = Lw* = 0. However any sample (xi,yi), . . . , (x„, y n ) reveals 
at most n of 2n signs r,-, and no information on the remaining > n signs. This means that for any algorithm 
used by the learner, there exists a choice of r^'s such that on at least n of the remaining points not seen by 
the learner the learner has to suffer a loss of at least yielding an overall risk of at least l/(2y/n). 

Infinite dimensional, smooth, non-separable, even if strongly convex 

Consider the squared loss <fi(h(x), y) — (h(x) — y) 2 which is 2-smooth and 2-strongly convex. For any a > 
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let d — y/n/cr and consider the following distribution: X is uniform over as before, but this time Y\X is 
random, with y|(X = e^) ~ A/"(^^, a), where again are pre-determined, unknown to the learner, random 

signs. The minimizer of the expected risk is w* = 5Z i=1 ^g e i, with ||w*|| = | and L* = L(w*) = a 2 . 
Furthermore, for any w <E W, 

1 d 1 

Lw - Lw* = E [(w - w*, x>] 2 = - 2(w[i] - w*[i]) 2 = - ||w - w*|| 2 



If the norm constraint becomes tight, i.e. ||w|| = 1, then L(w) — L(w*) > 1/ (Ad) = a/(Ay/n) = \J~L* j (4y/n). 
Otherwise, each coordinate is a separate mean estimation problem, with m samples, where rn is the number 
of appearances of ej in the sample. We have E [(w[i] — w*[i]) 2 ] = a 2 jjii and so 



L(w) - V = 




Finite dimensional, smooth, not strongly convex, non-separable: 

Take d = 1, with X = 1 with probability q and X — with probability 1 — q. Conditioned X = let 
y = deterministically, while conditioned on X = 1 let Y — +1 with probability p — | + and y = — 1 
with probability 1 — p. Consider the following 1-smooth loss function, which is quadratic around the correct 
prediction, but linear away from it: 



<MM X ),2/) 




if \h(x) - y\ < 1/2 
1/4 if \h(x)-y\ >l/2 



First note that irrespective of choice of w, when x = and so y = we always have h(x) — and so 
suffer no loss. This happens with probability 1 — q. Next observe that for p > 1/2, the optimal predictor is 
w* > 1/2. However, for n > 20, with probability at least 0.25, X^ILi Hi < 0, and so the empirical minimizer 
is w < —1/2. We can now calculate 

L(w) -L* > L(-l/2) - L(l/2) = 9 (2p - 1) + (1 - g)0 = 0Aq - 0A ^ 



iqn 

However note that for p > 1/2, w* = | — ^ and so for n > 20: 
Hence we conclude that with probability 0.25 over the sample, 



0.32L* 

L(w) — L > 



5 Implications 

We demonstrate the implications of our results in several settings. 



5.1 Improved Margin Bounds 

"Margin bounds" provide a bound on the expected zero-one loss of a classifiers based on the margin zero-one 
error on the training sample. Koltchinskii and Panchenko [13j provides margin bounds for a generic class 
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H based on the Rademacher complexity of the class. This is done by using a non-smooth Lipschitz "ramp" 
loss that upper bounds the zero-one loss and is upper-bounded by the margin zero-one loss. However, such 
an analysis unavoidably leads to a 1/y/n rate even in the separable case, since as we discuss in Sectional 
it is not possible to get a faster rate for a non-smooth loss. Following the same idea we use the following 
smooth "ramp" : 

( 1 t < 

{ t > 7 

2 

This loss function is ^--smooth and is lower bounded by the zero-one loss and upper bounded by the 7 
margin loss. Using Theorem [1] we can now provide improved margin bounds for the zero-one loss of any 
classifier based on empirical margin error. Denote err(ft) = E [ lr;,(x)^y}] the zero-one risk and for any 7 > 
and sample (xi,yi), . . . , (x n ,y n ) £ X x {±1} define the 7-margin empirical zero one loss as 

1 - 

err 7 (/i) ■= -J2 hvM^Xi} 

i=l 

Theorem 5. For any hypothesis class H, with \h\ < b, and any S > 0, with probability at least 1 — 5, 
simultaneously for all margins 7 > and all h E H : 

/j\ ^ - — - tr( tU\ ( log 1 fnjs . /log(log(f )/<5)\ log 3 n~2MM , Log(log(^)/5)\ 

err(/i) < err 7 (/i) + X I y err T (/i) I -&^n n (U) + \J ^ J + -^-TZ n (n) + ^ I 

where K is a numeric constant from Theorem [7] 



In particular, the above bound implies: 



21 °S 3 ^2^ 21og(log(f)/ ( 5) 



err(/i) < 1.0lSr 7 (ft) +JC 1— 

V 7 " / 

where if is an appropriate numeric constant. 

Improved margin bounds of the above form have been previously shown specifically for linear prediction 
in a Hilbcrt space (as in Support Vector Machines) based on the PAC Bayes theorem [20|, . However 
these PAC-Bayes based results are specific to the linear function class. Theorem [S] is, in contrast, a generic 
concentration-based result that can be applied to any function class with and yields rates dominated by 
TZ 2 (H). 



5.2 Interaction of Norm and Dimension 



Consider the problem of learning a low-norm linear predictor with respect to the squared loss </>(i, z) = (t—z) 2 , 
where X £ M. d , for finite but very large d, and where the expected norm of X is low. Specifically, let X 
be Gaussian with E ||A|| 2 = B, Y = (w*,X) + A/"(0,er 2 ) with ||w*|| = 1, and consider learning a linear 
predictor using £2 regularization. What determines the sample complexity? How does the error decrease as 
the sample size increases? 



From a scale-sensitive statistical learning perspective, we expect that the sample complexity, and the decrease 
of the error, should depend on the norm B, especially if d ^> B 2 . However, for any fixed d and B, even if 
d ^> B 2 , asymptotically as the number of samples increase, the excess risk of norm-constrained or norm- 
regularized regression actually behaves as L(w) — L* w —o 2 , and depends (to first order) only on the 
dimensionality d and not at all on B [17|. How does the scale sensitive complexity come into play? 

The asymptotic dependence on the dimensionality alone can be understood through Table [TJ In this non- 
separable situation, parametric complexity controls can lead to a 1/n rate, ultimately dominating the ^/^/n 
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rate resulting from L* > when considering the scale-sensitive, non-parametric complexity control B. (The 
dimension-dependent behavior here is actually a bit better then in the generic situation — the well-posed 
Gaussian model allows the bound to depend on a 2 = L* rather then on sup(w/a; — y) 2 « B 2 + a 2 ). 

Combining Theorem 2] with the asymptotic ^tr 2 behavior, and noting that at the worst case we can predict 
using a zero vector, yields the following overall picture on the expected excess risk of ridge regression with 
an optimally chosen A: 

( ( 2 B 2 Bo da 2 \\ 
L(w x ) — L < O [ min I B , 1 — 

Roughly speaking, each term above describes the behavior in a different regime of the sample size: 

• The first ("random") regime until n = Q(B 2 ) where the excess is is B 2 . 

• The second ( "low- noise" ) regime, where the excess risk is dominated by the norm and behaves as B 2 /n, 
until n = e{B 2 /a 2 ) and L(w) = 9(i*). 

• The third ("slow") regime, where the excess risk is controlled by the norm and the approximation error 
and behaves as Ba/y/n., until n = Q(d 2 a 2 /B 2 ) and L(w) = L* + Q(B 2 /d). 

• the fourth ("asymptotic") regimes, where the excess risk is dominated by the dimensionality and 
behaves as d/n. 

This sheds further light on recent work on this phenomena by Liang and Srebro based on exact asymptotics 
of simplified situations 18]. 



5.3 Sparse Prediction 



The use of the l\ norm has become very popular for learning sparse predictors in high dimensions, as in the 
LASSO. The LASSO estimator [30| w is obtained by considering the squared loss </>(z, y) = (z — y) 2 and 
minimizing L(w) subject to ||w||i < B. Let us assume there is some (unknown) sparse reference predictor 
w° that has low expected loss and sparsity (number of non-zeros) ||w°|| = k, and that HxH^ < 1, y < 1. In 
order to choose B and apply Theorem Q] in this setting, we need to bound Hw ]^. This can be done by, e.g., 
assuming that the features x[i] in the support o/w° are mutually uncorrelated. Under such an assumption, 
we have: || w° || ^ < fcE (w°,i) 2 < 2fc(L(w°) +E[y 2 ]) < 4k. Thus, Theorem Q] along with Rademacher 
complexity bounds from [ll| gives us, 



L(w) < L(w°) + 6 



fclog(d) kL(w°) log(d) 



(15) 



It is possible to relax the no-correlation assumption to a bound on the correlations, as in mutual incoherence, 
or to other weaker conditions 26j. But in any case, unlike typical analysis for compressed sensing, where 
the goal is recovering w° itself, here we are only concerned with correlations inside the support of w°. 
Furthermore, we do not need to require that the optimal predictor is sparse or close to being sparse, or that 
the model is well specified: only that there exists a good (low risk) predictor using a small number of fairly 
uncorrelated features. 

Bounds similar to (|15l) have been derived using specialized arguments 12, 32, B| — here we demonstrate that 
a simple form of these bounds can be obtained under very simple conditions, using the generic framework 
we suggest. 
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It is also interesting to note that the methods and results of Section [3] can also be applied to this setting. 
But since ||w||^ is not strongly convex with respect to ||w|| 1; we must instead use the entropy regularizer 

F(w)=Bj2^[i\log(^+^- (16) 

which is non-negative and 1-strongly convex with respect to on W = {w G K d |w[i] > 0, Hw^ < i?}, 

with F(w) < B 2 (l + \ogd) (we consider here only non-negative weights — in order to allow w[i] < we can 
include also each features negation, doubling the dimensionality). Recalling that ||w |L < 1\fk and using 
B = 2y/k in p^|) . we have from Theorem U] we that: 

L(w A) < L(w°) + O ( + J WW*) . (17) 

\ n V n I 

where w\ is the regularized empirical minimizer (1141) using the entropy regularizer (|16[) with A as in Theorem 
SJ The advantage here is that using Theorem @] instead of Theorem Q] avoids the extra logarithmic factors 
(yielding a clean big-O dependence in (fTT)) as opposed to big-O in ([15])). 

More interestingly, following Corollary [31 one can use stochastic mirror descent, taking steps of the form 
([T^l with the entropy regularizer ([TB"]) . to obtain the same performance guarantee as in (fT7[) . This provides 
an efficient, single-pass optimization approach to sparse prediction as an alternative to batch optimization 
with an £i-norm constraint, and yielding the same (if not somewhat better) guarantees. 
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A Technical proofs 



Lemma A.l. For any H-smooth non-negative function / : 1 1-> 1 and any t,r 6 R we have that 

(f(t)-f(r)) 2 <6H(f(t) + f(r))(t-r) 2 

Proof. We start by noting that by the mean value theorem for any t,r £ M. there exists s between t and r 
such that 

f(t)-f(r) = f'(.s)(t-r) (18) 

By smoothness we have that 

\f'(s)-f(t)\<H\t-s\<H\t-r\. 

Hence we see that 

\f'(s)\<\f'(t)\+H\t-r\ (19) 



We now consider two cases 
Case I: If \t 
(TT51) we have: 



Case I: If \t-r\ < then by Equation (HU), \f'(s)\ < 6/5|/'(i)|, and combining this with Equation 



(/(*) " f(r)) 2 < f'(sf(t - r) 2 < |/'W 2 (t " rf 
But Lemma lO ensures f'(t) 2 < 4Hf(t) yielding: 

<^Hf(t)(t~r) 2 <6Hf(t)(t-r) 2 (20) 

Case II: On the other hand, when \t — r\ > 5 g , we have from Equation (fT9|) that |/'(s)| < 6H \t — r|. 
Plugging this into Equation (jT5)) yields: 

(f(t) f(r)) 2 = \f(t) f(t)\ ■ \f(t) - f(r)\ < \f(t) - f(r)\ (\f(s)\ \t - r\) 

< \f(t) - /Ml (GH \t-r\-\t- r\) = 6H \f(t) - f(r)\ (t - rf 
<6Hmzx{f(t)J(r)}(t-r) 2 (21) 

Combining the two cases, we have from Equations ([2U)l and (f2~TjO and the non-negativity of /(•), that in 
either case: 

(f(t)-f(r)) 2 <6H(f(t) + f(r))(t-r) 2 □ 
Relating Fat-shattering Dimension and Rademacher complexity : 

The following lemma upper bounds the fat-shattering dimension at scale e > TZ n (l-L) in terms of the 



Rademacher Complexity of the function class. The proof closely follows the arguments of Mendelson [21 
discussion after Definition 4.2]. 

Lemma A. 2. For any hypothesis class H, any sample size n and any e > 7?. n ('H) we have that 

fat e (H) < 4 " y )2 



In particular, if lZ n (H) = y/ R/n (the typical case), then fat e (H) < R/e 2 
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Proof. Consider any e > lZ n (W). Let x*, . . . , Xf at be the set of fat e shattered points. This means that there 
exists s\, . . . , Sf a t e such that for any J C [fat e ] there exists hj £ H such that Vi € J, hj{xi) > Si + e and 
Vi ^ J, hj(xi) < Si — e. Now consider a sample Xi, . . . , x n i of size n' — [^|-]fat e , obtained by taking each 
x* and repeating it times, i.e. Xj = xT 4 . Now, following Mendelson's arguments: 

e L fat,- J 



^(H)>E_ Unif{±1} „, 



— sup 

n hen 



y^<Ti/i(xi) 



- 2 E c~Unif{±l}"' 



^cr-Uni^i!}" 



- 2 E o-~Umf{±l}" 



— sup 



— sup 

n h,h'eU 



Y a i(H x i) - h'(xi)) 
»=i 

XI o-(i-i)fet e +i ) - ft'(afi)) 

fete /Tn/fat,] 

X CT (i-l)fat«+j ) (M^*) - M^i)) 
i=l \ 3=1 



(triangle inequality) 



where for each a x , . . . , ov, i? C [fat e ] is given by R = ji e [fat e ] sign (^"i^ o"(i-i)rn/fet«l+j) > o|, /ir 
is the function in H that e-shatters the set R and h-^ be the function that shatters the complement of set R. 



^ 2 E <x~Unif{±l}" 



1 fate 

-Y 



[n/fatel 
X! CT (i-l)fete+j 



26 



fat. 



> 



> 



-Ye 

i=l 



<r~Unif{±l}' 1 ' 



[n/fete] 
^ ^(i-ljfate+j 



e fat e / [Vi/fat £ ] 



(Kinchine's inequality) 



e 2 fat e 
2 ra' 



We can now conclude that: 



fa t < 2 "^»'(^) < 



where last inequality is because Rademacher complexity decreases with increase in number of samples and 
n < n' < 2n (because e > TZ n (H) which implies that fat e < n). □ 



Dudley Upper Bound : 

In order to state and prove the following Lemma, we shall find it simpler to use the empirical Rademacher 
complexity for a given sample x%, . . . ,x n Q: 



Rn(H) = E CT ^Unif({±l}") 



1 

sup — 

hen n 



Yj H x i) a i 



(22) 



and the L2 covering number at scale e > specific to a sample xi, . . . , x n , denoted by N2 (e, F, (xi, . . . , x„)) 
as the size of a minimal cover C f such that 



V/ e F, 3/e e C e s.t. 



1 " 

-Y{f{z l )~h{z l )Y<e. 
\ *=1 
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We will also denote E [f 2 ] = ± E?=i / 2 (^)- 

The following Lemma is stated in terms of the empirical Rademacher complexity and covering numbers. 
Taking a supremum over samples of size n, we get the same relationship between the worst-case Rademacher 
complexity and covering numbers, as is used in Section [5] 

Lemma A. 3 ([27] following pH]). For any function class T containing functions f : X i— > R, we have that 



RJF) < inf { 4a + 10 



PferVWP] /logAA(e,J-, ( Xl ,...,X n )) 



de 



Proof. Let /3q = supj gF y E [f 2 ] and for any j e Z + let /3j = 2 _: ' supj 6F y fi [J 2 ]. The basic trick here is 
the idea of chaining. For each j let Tj be a (proper) L2-cover at scale j3j of T for the given sample. For each 
/ £ J and j, pick an /j £ Tj such that /j is an approximation of /. Now for any TV, we express / by 
chaining as 



N 



f = f ~ In + E (fi - fi- 

i=i 

where /o = 0. Hence for any N we have that 



R n {F) = -E CT 
n 



1 

< -E„ 
n 



N 



E "* /(xj) - /jv(xi) + 53 (/j(x 2 ) - /,-i(x. 



sup 



SUp V" 0~i (/(x<) - /jv(Xj) 



" 1 

E- 



SUp 53 (fifa) - /j-l( x i 



< 



_^ n n .V ft 

~\ 53°"! su p<\ EC/o^^-M^+E - ^ su p53 cr «(/j( x *)-/j-i( x i) 



- i 

< /3 W + V -E CT 

^ # TI 



3 = 1 



sup 53 °"i ( fj( x i) - fj-liXi) 

f^7~1 v 



where the step before last is due to Cauchy-Shwarz inequality and a = [o~i, cr n ] T - Now note that 

- 53(/i(^) - fi-i{*i)? = - 53 ((£(*<)) - /(*<)) + - 

i=l i=l 

9 71 2 9 ™ 

^ E (A-fc)) - /(*«•)) + - E (/fc) - 



(23) 



< 2/3| + 2)8;_ 1 = 6/3f 

Now Massart's finite class lemma [l9[ states that if for any function class Q, sup geg J ^ Y^ii=i d( x i) 2 — ^> 
then R„(G) < J 2R2 '° s(|g|) . Applying this to function classes {/ - /' : / € T j: f G (for each j) we 
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get from Equation (|23p that for any N, 



3=1 
N 



'mogfliji irj.,1) 



J 3 
AT 



</3 J v + 10^(/3,-/3 J+1 )l /1 ° g |Tjl 



N 



3 = 1 



<P N + l^3-P3 + l)^ 0gN{ ^ :F ^-- Xn)) 



J0N+1 V H 



where the third step is because 2(/3j — fij+i) — (3j and we bounded V24 by 5. Now for any a > 0, pick 
N = sup-jj : /3j > 2a}. In this case we see that by our choice of N, ftv+i < 2a and so (3n — 2/3jv+i < 4e. 
Also note that since (3n > 2a, Pn+i — % > a. Hence we conclude that 



Km < 4« + io ["" J^FEEE^M* 

Since the choice of a was arbitrary we take an infimum over a. □ 
Detailed Proof of Main Result : 

Detailed proof of Theorem [U By Theorem 6.1 of [(J (specifically the displayed equation prior to the 
last one in the proof of the theorem) we have that if ipn is any sub-root function that satisfies for all r > 0, 
KniL^r)) < t/>„(r) then, 



Lh < L(h) + 45< + VLh [ v/8r* + j46(log(l/e) + 61oglogn) + 206(log(l/e) + Gloglogn) ^ 



where r* is the largest solution to equation 4> n {r) = r. Now by Lemma 12.21 we have that VVi(f) = 
56V Hr log 1 ' 5 riR, n (H) satisfies the property that for all r > 0, lZ n (£^(r)) < ip n {r) and so using this we 
see that 

r* = 56 2 Hlog 3 nTZ n (n) 

and for this r* Equation (|24p holds. Now using the simple fact that for any non-negative A,B,C, 

A<B + CVA ^A<B + C 2 + VBC 

we conclude, 



Lh < L(h) + 106 r* + (log \ + log log n) + J L(h) ^8r* + (log \ + log log n) 
plugging in r* we get the required statement. □ 
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